-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add atomic_ref
support for 8 and 16b types.
#2255
base: main
Are you sure you want to change the base?
Conversation
…ms to be invalid though
🟩 CI finished in 4h 38m: Pass: 100%/417 | Total: 3d 08h | Avg: 11m 37s | Max: 1h 16m | Hits: 79%/34092
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
libcudacxx/include/cuda/std/__atomic/functions/cuda_ptx_derived.h
Outdated
Show resolved
Hide resolved
🟨 CI finished in 7h 41m: Pass: 98%/417 | Total: 2d 16h | Avg: 9m 13s | Max: 43m 42s | Hits: 97%/34092
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 3h 02m: Pass: 99%/421 | Total: 2d 14h | Avg: 8m 54s | Max: 1h 12m | Hits: 98%/34092
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 421)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 13h 52m: Pass: 94%/421 | Total: 2d 06h | Avg: 7m 43s | Max: 53m 08s | Hits: 94%/34092
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 421)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
libcudacxx/include/cuda/std/__atomic/functions/cuda_ptx_derived.h
Outdated
Show resolved
Hide resolved
🟨 CI finished in 6h 55m: Pass: 98%/421 | Total: 2d 07h | Avg: 7m 52s | Max: 56m 07s | Hits: 98%/34092
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 421)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
65 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 12h 23m: Pass: 95%/437 | Total: 2d 22h | Avg: 9m 40s | Max: 59m 39s | Hits: 97%/41584
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 437)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
66 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
__dst = __cuda_atomic_fetch_update( | ||
__ptr, | ||
[__op](_Type __old) { | ||
return __old < __op ? __old : __op; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, in general (not sure what _Type
s we support here as an extension):
fetch_min(op)
should be:op < old? op : old
, since assumingold
(the value in memory aka "the atomic objects value") is the "first argument",std::min
requires the first argument (old
) to be returned when the arguments compare equivalent.- For a similar reason,
fetch_max(op)
should be:old < op? op : old
.
@miscco @bernhardmgruber may know the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly concerned about changing runtime behavior, but on the other hand that is the more correct one
static inline _CCCL_DEVICE void | ||
__cuda_atomic_store(_Type* __ptr, _Type __val, _Order, _Operand, _Sco, __atomic_cuda_mmio_disable) | ||
{ | ||
// Store requires cas on 8/16b types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does store require cas on 8 and 16b types?
PTX atomic store (st.relaxed
,st.release
) supports .b8
,.u8
,.s8
,.b16
,.u16
,.s16
:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-st
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed emulation for load and store. More poignantly, there is no PTX constraint for 8b values. I'll evaluate using 16b ld/st.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR to use 16b ld/st. To make use of 8b we'd need to dispatch to assembly that uses a widened proxy for output/input. That will require adjustments that I don't want to add into this PR. We can do it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More poignantly, there is no PTX constraint for 8b values.
Right, libcu++ needs to zero extend the 8-bit value to a 16 bit value, pass the 16-bit value to PTX, and then directly pass that to the 8byte load/store ops. Doing it later is fine.
libcudacxx/include/cuda/std/__atomic/functions/cuda_ptx_derived.h
Outdated
Show resolved
Hide resolved
🟨 CI finished in 3h 50m: Pass: 99%/437 | Total: 2d 22h | Avg: 9m 41s | Max: 1h 20m | Hits: 90%/41584
|
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 437)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
66 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
Description
This enables the
atomic_ref
APIs to begin accepting 8 and 16b types. These types are emulated by 32 bit wide CAS loops.Performance is terrible and program correctness is based entirely on whether the surrounding memory is valid and atomically accessed.
closes #2051
Checklist